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Abstract 

We are developing an automatic method to 
compile an encyclopedic corpus from the 
Web. In our previous work, paragraph-style 
descriptions for a term are extracted from 
Web pages and organized based on domains. 
However, these descriptions are independent 
and do not comprise a condensed text as 
in hand-crafted encyclopedias. To resolve 
this problem, we propose a summarization 
method, which produces a single text from 
multiple descriptions. The resultant sum- 
mary concisely describes a term from dif- 
ferent viewpoints. We also show the effec- 
tiveness of our method by means of experi- 
ments. 

1 Introduction 

Term descriptions, which have been carefully orga- 
nized in hand-crafted encyclopedias, are valuable 
linguistic knowledge for human usage and compu- 
tational linguistics research. However, due to the 
limitation of manual compilation, existing encyclo- 
pedias often lack new terms and new definitions for 
existing terms. 

The World Wide Web (the Web), which contains 
an enormous volume of up-to-date information, is a 
promising source to obtain new term descriptions. 
It has become fairly common to consult the Web for 
descriptions of a specific term. However, the use of 
existing search engines is associated with the follow- 
ing problems: 

(a) search engines often retrieve extraneous pages 
not describing a submitted term, 

(b) even if desired pages are retrieved, a user has to 
identify page fragments describing the term, 

(c) word senses are not distinguished for polyse- 
mous terms, such as "hub (device and center)" , 

(d) descriptions in multiple pages are independent 
and do not comprise a condensed and coherent 
text as in existing encyclopedias. 

The authors of this paper have been resolving 
these problems progressively. For problems (a) and 
(b), Fujii and Ishikawa (2000) proposed an auto- 
matic method to extract term descriptions from the 



Web. For problem (c), Fujii and Ishikawa (2001) 
improved the previous method, so that the multiple 
descriptions extracted for a single term are catego- 
rized into domains and consequently word senses are 
distinguished. 

Using these methods, we have compiled an ency- 
clopedic corpus for approximately 600,000 Japanese 
terms. We have also built a Web site called "Cy- 
clone" 1 to utilize this corpus, in which one or more 
paragraph-style descriptions extracted from differ- 
ent pages can be retrieved in response to a user 
input. In Figure ^ three paragraphs describing 
"XML" are presented with the titles of their source 
pages. 

However, the above-mentioned problem (d) re- 
mains unresolved and this is exactly what we intend 
to address in this paper. 

In hand-crafted encyclopedias, a single term is de- 
scribed concisely from different "viewpoints", such 
as the definition, exemplification, and purpose. In 
contrast, if the first paragraph in Figure ^ is not 
described from a sufficient number of viewpoints 
for XML, a user has to read remaining paragraphs. 
However, this is inefficient, because the descriptions 
are extracted from independent pages and usually 
include redundant contents. 

To resolve this problem, we propose a summariza- 
tion method that produces a concise and condensed 
term description from multiple paragraphs. As a re- 
sult, a user can obtain sufficient information about a 
term with a minimal cost. Additionally, by reducing 
the size of descriptions, Cyclone can be used with 
mobile devices, such as PDAs. 

However, while Cyclone includes various types 
of terms, such as technical terms, events, and an- 
imals, the required set of viewpoints can vary de- 
pending the type of target terms. For example, the 
definition and exemplification are necessary for tech- 
nical terms, but the family and habitat are necessary 
for animals. In this paper, we target Japanese tech- 
nical terms in the computer domain. 

Section [3 outlines Cyclone. Sections and ^ex- 
plain our summarization method and its evaluation, 
respectively. In Section we discuss related work 
and the scalability of our method. 



1 http: //cyclone, slis. tsukuba.ac.jp/ 
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Figure 1: Example descriptions for "XML". 



2 Overview of Cyclone 

Figure [2] depicts the overall design of Cyclone, 
which produces an encyclopedic corpus by means of 
five modules: "term recognition" , "extraction", "re- 
trieval" , "organization" , and "related term extrac- 
tion". While Cyclone produces a corpus off-line, 
users search the resultant corpus for specific descrip- 
tions on-line. 

It should be noted that the summarization 
method proposed in this paper is not included in 
Figure [21 and that the concept of viewpoint has not 
been used in the modules in Figure |21 

In the off-line process, the input terms can be ei- 
ther submitted manually or collected by the term 
recognition module automatically. The term recog- 
nition module periodically searches the Web for mor- 
pheme sequences not included in the corpus, which 
are used as input terms. 

The retrieval module exhaustively searches the 
Web for pages including an input term, as performed 
in existing Web search engines. 

The extraction module analyzes the layout (i.e., 
the structure of HTML tags) of each retrieved page 
and identifies the paragraphs that potentially de- 
scribe the target term. While promising descrip- 
tions can be extracted from pages resembling on-line 
dictionaries, descriptions can also be extracted from 
general pages. 



The organization module classifies the multiple 
paragraphs for a single term into predefined domains 
(e.g., computers, medicine, and sports) and sorts 
them according to the score. The score is computed 
by the reliability determined by hyper-links as in 
Google 2 and the linguistic validity determined by a 
language model produced from an existing machine- 
readable encyclopedia. Thus, different word senses, 
which are often associated with different domains, 
can be distinguished and high-quality descriptions 
can be selected for each domain. 

Finally, the related term extraction module 
searches top-ranked descriptions for terms strongly 
related to the target term (e.g., "cable" and "LAN" 
for "hub"). Existing encyclopedias often provide re- 
lated terms for each headword, which are effective 
to understand the headword. In Cyclone, related 
terms can also be used as feedback terms to nar- 
row down the user focus. However, this module is 
beyond the scope of this paper. 

3 Summarization Method 
3.1 Overview 

Given a set of paragraph-style descriptions for a sin- 
gle term in a specific domain (e.g., descriptions for 
"hub" in the computer domain), our summarization 

2 http:/ /www. google. com/ 
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Figure 2: Overall design of Cyclone. 



method produces a concise text describing the term 
from different viewpoints. 

These descriptions are obtained by the organiza- 
tion module in Figure |21 Thus, the related term 
extraction module is independent of our summariza- 
tion method. 

Our method is multi-document summarization 
(MDS) UMani, 200T| ). Because a set of input docu- 
ments (in our case, the paragraphs for a single term) 
were written by different authors and/or different 
time, the redundancy and divergence of the topics in 
the input are greater than that for single document 
summarization. Thus, the recognition of similarity 
and difference among multiple contents is crucial. 
The following two questions have to be answered: 

• by which language unit (e.g., words, phrases, or 
sentences) should two contents be compared? 

• by which criterion should two contents be re- 
garded as "similar" or "different"? 

The answers for these questions can be different de- 
pending on the application and the type of input 
documents. 

Our purpose is to include as many viewpoints as 
possible in a concise description. Thus, we com- 
pare two contents on a viewpoint-by- viewpoint basis. 
In addition, if two contents are associated with the 
same viewpoint, we determine that those contents 
are similar and that they should not be repeated in 
the summary. 

Our viewpoint-based summarization (VBS) 
method consists of the following four steps: 

1. identification, which recognizes the language 
unit associated with a viewpoint, 

2. classification, which merges the identified units 
associated with the same viewpoint into a single 
group, 



3. selection, which determines one or more repre- 
sentative units for each group, 

4. presentation, which produces a summary in a 
specific format. 

The model is similar to those in existing MDS meth- 
ods. However, the implementation of each step 
varies depending on the application. We elaborate 
on the four steps in Sections 13.213. 51 respectively. 

3.2 Identification 

The identification module recognizes the language 
units, each of which describes a target term from a 
specific viewpoint. However, a compound or com- 
plex sentence is often associated with multiple view- 
points. The following example is an English trans- 
lation of a Japanese compound sentence in a Web 
page. 

XML is an abbreviation for extensible 
Markup Language, and is a markup lan- 
guage. 

The first and second clauses describe XML from the 
abbreviation and definition viewpoints, respectively. 
It should be noted that because "XML" and "ex- 
tensible Markup Language" are spelled out by the 
Roman alphabet in the original sentence, the first 
clause does not provide Japanese readers with the 
definition of XML. 

To extract the language units on a viewpoint-by- 
viewpoint basis, we segment Japanese sentences into 
simple sentences. However, sentence segmentation 
remains a difficult problem and the accuracy is not 
100%. First, we analyze the syntactic dependency 
structure of an input sentence by CaboCha 3 . Sec- 
ond, we use hand-crafted rules to extract simple sen- 
tences using the dependency structure. 

The simple sentences excepting the first clause of- 
ten lack the subject. To resolve this problem, zero 
pronoun detection and anaphora resolution can be 
used. However, due to the rudimentary nature of 
existing methods, we use hand-crafted rules to com- 
plement simple sentences with the subject. 

As a result, we can obtain the following two simple 
sentences from the above-mentioned input sentence, 
in which the complement subject is in parentheses. 

• XML is an abbreviation for extensible 
Markup Language. 

• (XML) is a markup language. 

3.3 Classification 

The classification module merges the simple sen- 
tences related to the same viewpoint into a single 
group. An existing encyclopedia for technical terms 
uses approximately 30 obligatory and optional view- 
points. We selected the following 12 viewpoints for 
which typical expressions can be coded manually: 



^ http:/ /cl. aist-nara.ac.jp/~taku-ku/softwarc/cabocha/ 



definition, abbreviation, exemplification, 
purpose, synonym, reference, product, ad- 
vantage, drawback, history, component, 
function. 

We manually produced 36 linguistic patterns used 
to describe terms from a specific viewpoint. These 
patterns are regular expressions, in which specific 
morphemes are generalized into parts-of-speech or 
the special symbol representing the target term. 

We use a two-stage classification method. First, 
the simple sentences that match with a pattern 
are classified into the associated viewpoint group. 
A simple sentence that matches with patterns for 
multiple viewpoints is classified into every possible 
group. 

However, the pattern-based method fails to clas- 
sify the sentences that do not match with any prede- 
fined patterns. Thus, second we classify the remain- 
ing sentences into the group in which the most simi- 
lar sentence has already been classified. In practice, 
we compute the similarity between an unclassified 
sentence and each of the classified sentences. The 
similarity between two sentences is determined by 
the Dice coefficient, i.e., the ratio of content words 
commonly included in those sentences. The sen- 
tences unclassified through the above method are 
classified into the "miscellaneous" group. 

In summary, our two-stage method uses prede- 
fined linguistic patterns and statistics of words. 

The following examples are English translations 
of Japanese sentences extracted in the identification 
module. These sentences can be classified into a spe- 
cific group on the ground of the underlined expres- 
sions, excepting sentence (e). However, in the second 
stage, sentence (e) can be classified into the history 
group, because sentence (e) is most similar to sen- 
tence (c). 

(a) XML is an extensible markup language. 
— > definition 

(b) an abbreviation for extensible Markup Lan- 
guage 

— > abbreviation 

(c) was advised as a standard by W3C in 1998 
— > history 

(d) XML is an abbreviation for Extensible Markup 
Language 

— ► abbreviation 

(e) the standard of XML was advised by W3C 
— * ??? —* history 

3.4 Selection 

The selection module determines one or more rep- 
resentative sentences for each viewpoint group. The 
number of sentences selected from each group can 
vary depending on the desired size of the resultant 
summary. 



We consider the following factors to compute the 
score for each sentence and select sentences with 
greater scores in each group. 

• the number of common words included (W) 
The representative sentences should contain 
many words that are common in the group. We 
collect the frequencies of words for each group, 
and sentences including frequent words are pre- 
ferred. 

• the rank in Cyclone (R) 

As depicted in Figure [SJ Cyclone sorts the re- 
trieved paragraphs according to the plausibility 
as the description. Sentences in highly-ranked 
paragraphs are preferred. 

• the number of characters included (C) 

To minimize the size of a summary, short sen- 
tences are preferred. 

Because these factors are different in terms of 
the dimension, range, and polarity, we normalize 
each factor in [0,1] and compute the final score as a 
weighed average of the three factors. The weight of 
each factor was determined by a preliminary study. 
In brief, the relative importance among the three 
factors is W>R>C. 

However, because the miscellaneous group in- 
cludes various viewpoints, we use a different method 
from that for the regular groups. First, we select rep- 
resentative sentences from the regular groups. Sec- 
ond, from the miscellaneous group, we select the sen- 
tence that is most dissimilar to the sentences already 
selected as representatives. We use the Dice-based 
similarity used in Section 13.31 to measure the dis- 
similarity between two sentences. If we select more 
than one sentence from the miscellaneous group, the 
second process is repeated recursively. 

3.5 Presentation 

The presentation module lists the selected sentences 
without any post-editing. Ideally, natural language 
generation is required to produce a coherent text by, 
for example, complementing conjunctions and gen- 
erating anaphoric expressions. However, a simple 
list of sentences is also useful to obtain knowledge 
about a target term. 

Figure |3] depicts an example summary produced 
from the top 50 paragraphs for the term "XML" . In 
this figure, six viewpoint groups and the miscella- 
neous group were formed and only one sentence was 
selected from each group. The order of sentences 
presented was determined by the score computed in 
the selection module. 

While the source paragraphs consist of 11,224 
characters, the summary consists of 397 characters, 
which is almost the same length as an abstract for a 
technical paper. 

The following is an English translation of the sen- 
tences in Figure El Here, the words spelled out by 
the Roman alphabet in the original sentences are in 
italics. 
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• definition: XML is an extensible markup lan- 
guage (extensible Markup Language). 

• abbreviation: an abbreviation for Extensible 
Markup Language (an extensible markup lan- 
guage). 

• purpose: Because XML is a standard specifi- 
cation for data representation, the data defined 
by XML can be reusable, irrespective of the up- 
per application. 

• advantage: XML is advantageous to develop- 
ers of the file maker Pro, which needs to receive 
data from the client. 

• history: was advised as a standard by W3C 
( World Wide Web Consortium: a group stan- 
dardizing WWW technologies) in 1998, 

• reference: This book is an introduction for 
XML, which has recently been paid much at- 
tention as the next generation Internet standard 
format, and related technologies. 

• miscellaneous: In XML, the tags are enclosed 
in "<" and ">". 

Each viewpoint label or sentence is hyper-linked to 
the associated group or the source paragraph, re- 
spectively, so that a user can easily obtain more in- 
formation on a specific viewpoint. For example, by 



the reference sentence, a catalogue page of the book 
in question can be retrieved. 

Although the resultant summary describes XML 
from multiple viewpoints, there is a room for im- 
provement. For example, the sentences classified 
into the definition and abbreviation viewpoints in- 
clude almost the same content. 

4 Evaluation 
4.1 Methodology 

Existing methods for evaluating summarization 
techniques can be classified into intrinsic and extrin- 
sic approaches. 

In the intrinsic approach, the content of a sum- 
mary is evaluated with respect to the quality of a 
text (e.g., coherence) and the informativeness (i.e., 
the extent to which important contents are in the 
summary) . In the extrinsic approach, the evaluation 
measure is the extent to which a summary improves 
the efficiency of a specific task (e.g., relevance judg- 
ment in text retrieval). 

In DUC 4 and NTCIR 5 , both approaches have 
been used to evaluate summarization methods tar- 
geting newspaper articles. However, because there 
was no public test collections targeting term descrip- 
tions in Web pages, we produced our test collection. 

4 http:/ /due. nist.gov/ 

5 http: / / research. nii .ac.jp/ntcir/ index-en. html 



As the first step of our summarization research, we 
addressed only the intrinsic evaluation. 

In this paper, we focused on including as many 
viewpoints (i.e., contents) as possible in a summary, 
but did not address the text coherence. Thus, we 
used the informativeness of a summary as the evalu- 
ation criterion. We used the following two measures, 
which are in the trade-off relation. 

• compression ratio 

^characters in summary 
^characters in Cyclone result 

• coverage 

^viewpoints in summary 
# viewpoints in Cyclone result 

Here, "^viewpoints" denotes the number of view- 
point types. Even if a summary contains multiple 
sentences related to the same viewpoint, the numer- 
ator is increased by 1. 

We used 15 Japanese term in an existing computer 
dictionary as test inputs. English translations of the 
test inputs are as follows: 

10BASE-T, ASCII, SQL, XML, accumu- 
lator, assembler, binary number, crossing 
cable, data warehouse, macro virus, main 
memory unit, parallel processing, resolu- 
tion, search time, thesaurus. 

To calculate the coverage, the simple sentences 
in the Cyclone results have to be associated with 
viewpoints. To reduce the subjectivity in the evalu- 
ation, for each of the 15 terms, we asked two college 
students (excluding the authors of this paper) to an- 
notate each simple sentence in the top 50 paragraphs 
with one or more viewpoints. The two annotators 
performed the annotation task independently. The 
denominators of the compression ratio and coverage 
were calculated by the top 50 paragraphs. 

During a preliminary study, the authors and anno- 
tators defined 28 viewpoints, including the 12 view- 
points targeted in our method. We also defined the 
following three categories, which were not considered 
as a viewpoint: 

• non-description, which were also used to anno- 
tate non-sentence fragments caused by errors in 
the identification module, 

• description for a word sense independent of the 
computer domain (e.g., "hub" as a center, in- 
stead of a network device), 

• miscellaneous. 

It may be argued that an existing hand-crafted 
encyclopedia can be used as the standard sum- 
mary. However, paragraphs in Cyclone often con- 
tain viewpoints not described in existing encyclope- 
dias. Thus, we did not use existing encyclopedias in 
our experiments. 



4.2 Results 

Table^shows the compression ratio and coverage for 
different methods, in which "#Reps" and "#Chars" 
denote the number of representative sentences se- 
lected from each viewpoint group and the number 
of characters in a summary, respectively. We always 
selected five sentences from the miscellaneous group. 
The third column denotes the compression ratio. 

The remaining columns denote the coverage on 
a annotator-by-annotator basis. The columns "12 
Viewpoints" and "28 Viewpoints" denote the case 
in which we focused only on the 12 viewpoints tar- 
geted in our method and the case in which all the 
28 viewpoints were considered, respectively. 

The columns "VBS" and "Lead" denote the cover- 
age obtained with our viewpoint-based summariza- 
tion method and the lead method. The lead method, 
which has often been used as a baseline method in 
past literature, systematically extracted the top N 
characters from the Cyclone result. Here, N is the 
same number in the second column. 

In other words, the compression ratio of the VBS 
and lead methods was standardized, and we com- 
pared the coverage of both methods. The compres- 
sion ratio and coverage were averaged over the 15 
test terms. 

Suggestions which can be derived from Tableware 
as follows. 

First, in the case of "#Reps=l", the average 
size of a summary was 616 characters, which is 
marginally longer than an abstract for a techni- 
cal paper. In the case of "#Reps=3", the average 
summary size was 1309 characters, which is almost 
the maximum size for a single description in hand- 
crafted encyclopedias. A summary obtained with 
four sentences in each group is perhaps too long as 
term descriptions. 

Second, the compression ratio was roughly 10%, 
which is fairly good performance. It may be argued 
that the compression ratio is exaggerated. That is, 
although paragraphs ranked higher than 50 can po- 
tentially provide the sufficient viewpoints, the top 50 
paragraphs were always used to calculate the domi- 
nator of the compression ratio. 

We found that the top 38 paragraphs, on average, 
contained all viewpoint types in the top 50 para- 
graphs. Thus, the remaining 12 paragraphs did not 
provide additional information. However, it is dif- 
ficult for a user to determine when to stop reading 
a retrieval result. In existing evaluation workshops, 
such as NTCIR, the compression ratio is also calcu- 
lated using the total size of the input documents. 

Third, the VBS method outperformed the lead 
method in terms of the coverage, excepting the case 
of "=#=Rcps=l" focusing on the 12 viewpoints by an- 
notator B. However, in general the VBS method pro- 
duced more informative summaries than the lead 
method, irrespective of the compression ratio and 
the annotator. 

It should be noted that although the VBS method 



Table 1: Results of summarization experiments. 



#Reps 


#Chars 


Compression 
ratio (%) 


Coverage by annotator A (%) 


Coverage by annotator B (%) 


12 Viewpoints 28 Viewpoints 


12 Viewpoints 28 Viewpoints 


VBS Lead VBS Lead 


VBS Lead VBS Lead 


1 


616 


5.97 


56.62 52.84 49.49 44.84 


50.00 53.61 49.49 47.56 


2 


998 


9.61 


73.43 57.23 59.26 53.70 


64.50 62.96 60.75 57.37 


3 


1309 


12.61 


76.04 59.29 63.13 56.44 


67.83 64.81 65.22 60.84 



targets 12 viewpoints, the sentences selected from 
the miscellaneous group can be related to the re- 
maining 16 viewpoints. Thus, even if we focus on 
the 28 viewpoints, the coverage of the VBS method 
can potentially increase. 

It should also be noted that all viewpoints are not 
equally important. For example, in an existing en- 
cyclopedia ( |Nagao and others, 1990| ) the definition, 
exemplification, and synonym are regarded as the 
obligatory viewpoints, and the remaining viewpoints 
are optional. 

We investigated the coverage for the three obliga- 
tory viewpoints. We found that while the coverage 
for the definition and exemplification ranged from 
60% to 90%, the coverage for the synonym was 50% 
or less. 

A low coverage for the synonym is partially due 
to the fact that synonyms are often described with 
parentheses. However, because parentheses are used 
for various purposes, it is difficult to identify only 
synonyms expressed with parentheses. This problem 
needs to be further explored. 

5 Discussion 

The goal of our research is to automatically compile 
a high-quality large encyclopedic corpus using the 
Web. Hand-crafted encyclopedias lack new terms 
and new definitions for existing terms, and thus the 
quantity problem is crucial. The Web contains un- 
reliable and unorganized information and thus the 
quality problem is crucial. We intend to alleviate 
both problems. To the best of our knowledge, no 
attempt has been made to intend similar purposes. 

Our research is related to question answering 
(QA). For example, in TREC QA track, definition 
questions are intended to provide a user with the def- 
inition of a target item or person HVoorhees, 2003). 
However, while the expected answer for a TREC 
question is short definition sentences as in a dic- 
tionary, we intend to produce an encyclopedic text 
describing a target term from multiple viewpoints. 

The summarization method proposed in this pa- 
per is related to multi-document summarization 
(MPS) HJVlani, 20011 |Radev and McKeown, 1998| 
Schiff man et al., 200 1| ). The novelty of our research 
is that we applied MDS to producing a condensed 
term description from unorganized Web pages, while 
existing MDS methods used newspaper articles to 
produce an outline of an event and a biography of 
a specific person. We also proposed the concept of 



viewpoint for MDS purposes. 

While we targeted Japanese technical terms in the 
computer domain, our method can also be applied to 
other types of terms in different languages, without 
modifying the model. However, a set of viewpoints 
and patterns typically used to describe each view- 
point need to be modified or replaced depending the 
application. Given annotated data, such as those 
used in our experiments, machine learning methods 
can potentially be used to produce a set of view- 
points and patterns for a specific application. 

6 Conclusion 

To compile encyclopedic term descriptions from the 
Web, we introduced a summarization method to our 
previous work. Future work includes generating a 
coherent text instead of a simple list of sentences 
and performing extensive experiments including an 
extrinsic evaluation method. 
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